Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor unicode emoji parsing #20

Closed
wants to merge 10 commits into from

Conversation

freya022
Copy link
Contributor

Depends on #19

# Conflicts:
#	lib/src/jmh/java/benchmark/EmojiManagerAliasBenchmark.java
#	lib/src/main/java/net/fellbaum/jemoji/EmojiManager.java
@freya022 freya022 marked this pull request as ready for review November 16, 2023 13:24
@felldo
Copy link
Owner

felldo commented Nov 23, 2023

I also had the idea to do it in a way like you did, but didn't do it due to performance and future extension ability concerns.
To the extension ability concern:
Currently this refactor isn't really suited because of the planned index method for emojis. You return the codepoint index which most likely most people don't want. I had a prototype which returned both the codepoint index and char index and this wouldn't really be possible with this refactor as there is a smal performance bottleneck and this would be applied to every method. To get the string index I had to use a StringBuilder to keep track of the already processed characters and then get the length of it when it detects an emoji. This not only consumes a bit more memory but is also slower as its another step.

When it comes to performance, the refactor also seems to be worse with some methods. These were my benchmark results with the changed parameters to 10 iterations and 3 forks in a try to get more accurate results:

//Current code base
Benchmark                                                             Mode  Cnt   Score   Error  Units
EmojiManagerBenchmark.extractEmojisInOrder                            avgt   30   2,157 ± 0,049  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisLengthDescending  avgt   30   9,752 ± 0,028  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisRandomOrder       avgt   30  10,300 ± 0,046  ms/op
EmojiManagerBenchmark.removeAllEmojis                                 avgt   30   2,915 ± 0,044  ms/op
EmojiManagerBenchmark.replaceAllEmojis                                avgt   30   2,949 ± 0,121  ms/op
EmojiManagerBenchmark.replaceAllEmojisFunction                        avgt   30   2,849 ± 0,008  ms/op
//Refactor
Benchmark                                                             Mode  Cnt   Score   Error  Units
EmojiManagerBenchmark.extractEmojisInOrder                            avgt   30   4,209 ± 0,980  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisLengthDescending  avgt   30   9,807 ± 0,045  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisRandomOrder       avgt   30  10,355 ± 0,044  ms/op
EmojiManagerBenchmark.removeAllEmojis                                 avgt   30   2,962 ± 0,041  ms/op
EmojiManagerBenchmark.replaceAllEmojis                                avgt   30   3,076 ± 0,034  ms/op
EmojiManagerBenchmark.replaceAllEmojisFunction                        avgt   30   3,238 ± 0,106  ms/op

So currently I would tend to not merge this PR

@freya022
Copy link
Contributor Author

After discussion, the above benchmark is erroneous (look at the huge margin of error in extractEmojisInOrder)

My runs were as follow:
Master:
image

Refactor:
image

PR #22 will require getting the char index of the emoji, which isn't necessary for most operations, so, this PR will be closed for that reason

While String#offsetByCodePoints could have been used when an emoji is received, it must travel most of the string until the specified offsets, manually keeping track of the character index ourselves is likely cheaper in these cases

@freya022 freya022 closed this Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants